Skip to content

Conversation

@jungpark-mlir
Copy link
Contributor

@jungpark-mlir jungpark-mlir commented Oct 29, 2025

Enable Gluon kernels to express and compile warp-pipelined loops—where different warps execute staggered stages (e.g., load, compute, store)—to improve compute–memory overlap and utilization.

This is achieved through a structured, two-phase lowering pipeline:

  1. Frontend (Gluon → TritonGPU):
  • Adds a new API call: gl.amd.split_warp_pipeline(), which marks pipeline stage boundaries inside a Gluon kernel.
  • The new TritonAMDGPUWarpPipeline pass converts loops containing split points into structured scf.execute_region clusters, annotated with total_stages and lead_stages.
  1. Backend (TritonGPU → LLVM):
  • The ConvertWarpPipeline pass lowers each scf.execute_region cluster into predicated execution guarded by conditional barriers (amdgpu.cond_barrier).
  • Inserts scheduler and workgroup barriers (rocdl.sched.barrier, rocdl.s.barrier) to enforce correct cross-stage ordering and prevent instruction reordering.

Future work

  • Automatic partitioning frontend for Triton kernel : migrating legacy block-pingpong and entirely new partitioning pass

partitioning of the code into stages.
backup
can correctly insert fence.
update interfaces per recent changes
make it work actually
fix wrongly offset insertion
refactor loop
code cleanup
barrier should be inserted from the warp causing the dependency.
Added builtin split_warp_pipeline(), inserting the builtin
splits the code region into two pipeline clusters.
now runs on mi350
- polish conversion code
- found an important fix needed, just commented for now.
custom_lds_size = 0
amd.passes.ttgpuir.add_optimize_lds_usage(pm, options.arch, custom_lds_size)
amd.passes.ttgpuir.add_warp_pipeline_conversion(pm)
passes.common.add_canonicalizer(pm)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need another full canonicalization pass here? Might be better to do targeted cleanups

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 158 to 162
forOp->setAttr("triton.warp_pipeline.total_stages",
b.getI32IntegerAttr(totalStages));
forOp->setAttr("triton.warp_pipeline.lead_stages",
b.getI32IntegerAttr(1)); // TODO: make configurable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do those attributes control?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These don't do anything right now, I'll change them as a unit attribute to identify pipelined scf.for and will consider it again once I got a more concrete idea to use these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, both removed and replaced with .pipelined_for.

cluster.push_back(op);
}
if (!cluster.empty())
clusters.push_back(std::move(cluster));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we create the regions directly rather than having a pass post process those?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically coming from considering how to program warp-pipeline in Gluon. First, I considered using python function to define a region as like warp-specialization but there were some issues, scf.execute_region doesn't have a block argument and Gluon user doesn't fully know the values required to be yield'ed. It might not be impossible to rewrite a python function into the scf.execute_region but required analysis might be even complicated than just defining clusters by the pipeline borders. Also border-based pipelining method can prevent user from mistakenly locating operations out of the clusters when pipelining.
This is also helpful when we migrate existing block-pingpong scheduling, this pass can be used for non-Gluon pass as well. New auto-partitioning will be directly creating regions, might be able to replace the others but not sure yet.

Comment on lines 222 to 237
void runOnOperation() override {
ModuleOp m = getOperation();
OpBuilder builder(m);
ModuleAllocation moduleAllocation(m);

for (auto funcOp : m.getOps<mlir::triton::FuncOp>()) {
Allocation *allocation = moduleAllocation.getFuncData(funcOp);
funcOp.walk([&](scf::ForOp forOp) {
if (auto totalStages =
forOp->getAttr("triton.warp_pipeline.total_stages")) {
Location loc = forOp.getLoc();
emitPipelinedFor(builder, loc, forOp, allocation);
}
});
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't the region be lowered by normal pattern rewrite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be better idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

- Simplify discardable attr for marking pipeline
- Change to use pattern match to convert ops.
region is now inlined in the pass and no longer needed.
@antiagainst antiagainst marked this pull request as ready for review November 1, 2025 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants